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Today, powerful microcomputers are not only beginning to affect a redesign of the 
structure and content of school curricula and the entire process of instruction and 
learning, they are also having a decided impact on the types of tests created and used 
to assess that learning. In fact, computerized testing is increasingly being viewed as a 
practical alternative to paper-and-pencil testing (Kingsbury & Houser, 1993). Tests 
administered at computer terminals or on personal computers are known as 
computerized tests. Given the advantages of individual, time-independent language 
testing, computer-based testing will no doubt prove to be a positive development in 
assessment practice (Brown, 1997, p. 46). 

COMPUTER ADAPTIVE TESTING AND SECOND 
LANGUAGE ASSESSMENT 



Second language (L2) computer-adaptive testing (CAT) is a technologically advanced 
method of assessment in which the computer selects and presents test items to 
examinees according to the estimated level of the examinee's language ability. The 
basic notion of an adaptive test is to mimic automatically what a wise examiner would 
normally do. Specifically, if an examiner asked a question that turned out to be too 
difficult for the examinee, the next question asked would be considerably easier. This 
approach stems from the realization that we learn little about an individual's ability if we 
persist in asking questions that are far too difficult or far too easy for that person. We 
learn the most about an examinee's ability when we accurately direct our questions at 
the current level of the examinee's ability (Wainer, 1 990, p. 1 0). 

Thus, in a CAT, the first item is usually of a medium-difficulty level for the test 
population. An examinee who responds correctly will then receive a more difficult item. 
An examinee who misses the first item will be given an easier question. And so it goes, 
with the computer algorithm adjusting the selection of the items interactively to the 
successful or failed responses of the test taker. 

ADVANTAGES OF USING 

COMPUTER- ADAPTIVE TESTING IN SECOND 

LANGUAGE 
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ASSESSMENTIn a CAT, each examinee takes a unique test that is tailored to his or her 
ability level. Avoided are questions that have low information value about the test taker's 
proficiency. The result of this approach is higher precision across a wider range of 
ability levels (Carlson, 1994, p. 218). In fact, CAT was developed to eliminate the 
time-consuming and inefficient (and traditional) test that presents easy questions to 
high-ability persons and excessively difficult questions to low-ability testees. Other 
advantages of CAT include the following: 



* "Self-Pacing." CAT allows test takers to work at their own pace. The speed of 
examinee responses could be used as additional information in assessing proficiency, if 
desired and warranted. 



* "Challenge." Test takers are challenged by test items at an appropriate level; they are 
not discouraged or annoyed by items that are far above or below their ability level. 



* "Immediate Feedback." The test can be scored immediately, providing instantaneous 
feedback for the examinees. 



* "Improved Test Security." The computer contains the entire item pool, rather than 
merely those specific items that will make up the examinee's test. As a result, it is more 
difficult to artificially boost one's scores by merely learning a few items or even types of 
items (Wainer, 1990). However, in order to achieve improved security, the item pool 
must be sufficiently large to ensure that test items do not reappear with a frequency 
sufficient to allow examinees to memorize them. 



* "Multimedia Presentation." Tests can include text, graphics, photographs, and even 
full-motion video clips, although multimedia CAT development is still in its infancy. 

Individual test takers are not the only ones who can benefit from CAT. Green et al. 

(1 995) point out that computerized testing could benefit a variety of agencies and 
groups, such as those described below: 
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* Educators considering the use of a published or in-house CAT to assess student 
achievement in large-enrollment second language classrooms or programs. 



* Licensing boards needing to develop a CAT to help them identify candidates who 
meet specific performance standards for licensure. One such CAT is the Occupational 
English Test (OET) developed on behalf of the Australian Government (McNamara, 
1991). 



* Agencies preparing user guides for their computer-adaptive achievement tests, such 
as ETS's Graduate Record Examination. 



* Departments of education wishing to develop a CAT version of statewide minimum 
competency tests. 



* Departments of modern foreign languages wanting to create a proficiency CAT for 
entrance into or exit from required language courses, such as the Ohio State 
University's Multimedia Computer-Adaptive Test (MultiCAT) of French, German, and 
Spanish. 

COMPUTER- ADAPTIVE TESTING: ROOTS AND 
CHALLENGES 



In the 1960s and 1970s, the U.S. Department of Defense perceived the potential 
benefits of adaptive testing and supported extensive theoretical research in CAT and 
Item Response Theory (IRT), the family of psychometric models underlying 
computer-adaptive testing (Wainer, 1990, p. 10). IRT is based on probabilistic theory; 
that is, it calculates the probability of a given person getting a particular item right 
(Alderson, Clapham, & Wall, 1995). Examinees' scores and item total statistics are 
transformed into one scale so that they can be related to each other. If a person's ability 
is the same as the difficulty level of the item, that person has a 50-50 chance of getting 
that item right. If their ability is at a lower level than that of the item, that probability 
decreases. The relationship between the examinee's item performance and the abilities 
underlying item performance is described in an item characteristic curve (ICC). As the 
level of students' ability increases, so does the probability of a correct response (see 
Alderson, Clapham, & Wall, 1995, p. 90). 



Page 4 of 9 



ED435202 1999-10-00 Considerations in Developing and Using Computer-Adaptive 
Tests To Assess Second Language Proficiency. ERIC Digest. 



ERIC Resource Center 



www . eric . ed . gov 



Early attempts to build adaptive tests by the U.S. Army, Navy, and Air Force were often 
less than successful, very expensive, and used large-scale computers. However, by the 
early 1980s, personal computers had acquired the power of the large-scale computers 
of earlier years, and the pioneering efforts of IRT theorists had perfected the 
psychometric model underlying CAT. In the late 1980s, CAT finally moved out of the 
realm of theory and supposition into the sphere of possibility and implementation with 
the advent of the College Board's CAT Graduate Record Examination and with the work 
of in-house researchers in foreign language education at the Defense Language 
Institute and at universities throughout the United States, Britain, the Netherlands, and 
other countries. 

Today, with software development companies assisting test developers with their own 
institutional L2 CATs, computer-adaptive testing has finally become a viable alternative 
to conventional paper-and-pencil testing. Commercial CAT programs such as those 
offered by the Assessment Systems Corporation (St. Paul, Minnesota) and 
Computer-Adaptive Technologies (Chicago, Illinois) make it easier for developers to 
create L2 CATs using software templates rather than having to start programming and 
development from scratch. It is anticipated that in the future, more and more commercial 
companies and academic institutions will be producing testing shells that can be used to 
create CATs for placement, achievement, and licensing purposes. 

ISSUES INVOLVING THE BASIC PRINCIPLES OF 
ASSESSMENT IN COMPUTER-ADAPTIVE 



TESTINGA number of questions need to be addressed when considering the basic 
principles of assessment in computer-adaptive testing. "Is the Computerized Testing 
System Appropriate for the Purpose of the Test?" 

L2 CAT developers need to clearly identify and specify the assessment purpose of their 
tests. This is important because CATs can be used for a wide variety of purposes, 
including the following: 



* Identifying whether an individual has met the specific objectives of a basic language or 
literature course. 



* Indicating an individual's level of achievement in a skill domain (e.g., listening 
comprehension or grammar knowledge). 
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* Identifying specific areas in which a student needs additional educational experiences 
(e.g., knowledge and use of specific grammatical points or recognition of specific idioms 
and vocabulary items). 



* Diagnosing an individual's skill-area strengths (e.g., the ability to recognize main ideas 
presented in a spoken mini-lecture) and weaknesses (e.g., inability to recall specific 
details from a short conversation about an academic topic). ‘Detecting whether 
candidates have met minimum course requirements as demonstrated in a mastery test. 

In addition to clearly stating the purposes of the test, CAT developers must ensure that 
the CAT is able to measure the examinee's true proficiency level (Green et al., 1995). 

To achieve this goal, an L2 CAT must provide examinees with a sufficiently broad range 
of L2 content areas and skill tasks to ensure that their true proficiency is indeed 
measured in the test items taken. Because examinees may be of high or low proficiency 
levels, the CAT must be designed in such a way as to provide adequate assessment for 
the entire range of ability represented in the examinee population (Green et al., 1995, p. 
2). This objective may most easily be accomplished by obtaining or designing a CAT 
that includes the entire range of ability in its item pool. For example, in the case of a 
general listening proficiency CAT, items in the pool must cover low to high listening 
ability levels. In addition, the items need to include a variety of listening tasks, such as 
comprehension of the main ideas of a conversation or mini-lecture, recognition and 
recall of details of a conversation, identification of specific words and phrases used in a 
passage, and so forth. 

To achieve both objectives, the item selection algorithm must constrain the selection of 
items not just on the basis of the statistical parameter associated with the test item 
(such as the difficulty level), but it must also be able to present a variety of designated 
listening comprehension tasks to the examinees. 

"Does the CAT Embody the Basic Principles of Reliability?" 

Reliability refers to the precision and consistency of scores derived from a test 
instrument. It is a function of general, situational, and individual factors (Cohen, 1994) 
that can be used to frame evaluative questions for the developers of the test. General 
factors influencing reliability include, for example, whether instructions for the 
examinees are clear and explicit, or whether the examinees are sufficiently familiar with 
the format of the CAT before taking it. Situational factors include those related to the 
testing environment, such as noise level or whether headphones are provided. 

Individual factors include transient factors, such as the physical and psychological 
health of the test takers, and stable factors, such as examinees' experience with similar 
tests. 
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"Does the CAT Embody the Basic Principles of Validity?" 

Validity refers to whether a test actually measures what it purports to measure. It relates 
to the appropriacy of the inferences made on the basis of the test scores. There are 
several aspects of validity: content, construct, criterion, concurrent, and predictive. CAT 
developers and users need to examine issues related to each of these types of validity. 

"Do the Examinees Have an Opportunity to Become Familiar with the Computer, the 
CAT System, and the Structure, Organization, and Content Domains of the CAT?" 

Examinees should be given the time and opportunity to become thoroughly familiar with 
both the computer and the testing system. For first-time computer users, there should 
be an orientation to the functioning of the computer (e.g., using a mouse, calling for 
questions, answering questions, adjusting the audio volume, scrolling, etc.). An 
orientation to the structure and types of items they will encounter during the CAT should 
also be required for all examinees. The practice items should be equivalent in structure 
and content to those contained in the item bank. 

"Is the Item Pool of an Appropriate Quality to Support the Test Purpose(s) and to 
Measure the Identified Ability of the Examinee Population?" 

The depth and breadth of the item pool from which individual items are drawn strongly 
affects the validity and utility of the resulting CAT scores. Because of this, in addition to 
ensuring that the items tap the variety of specific tasks and content areas pertinent to 
the identified purpose of the CAT, the developers and users of the scores need to be 
able to specify exactly what the items in the bank assess. For instance, in an academic 
listening proficiency CAT, the designers could specify that all examinees must 
demonstrate comprehension of the main ideas of a mini-lecture and comprehension of 
the details of a short dialog. They may also wish to set other specific skills for certain 
ability levels. For instance, advanced listeners should be able to understand the implied 
meaning of utterances. 

CONCLUSION 



Computer-adaptive testing shows promise in becoming a regular component of 
standardized foreign language assessment in the coming century, particularly for 
licensing and certification purposes. Many benefits accrue to examinees and 
administrators alike when using CAT. However, to reap these benefits, numerous 
checks and balances need to be put into place so that the potential pitfalls in the 
development and uninformed use of CAT are avoided. Developers and users alike need 
to understand fully what a CAT is and how it operates. They also need to be aware of 
what the underlying psychometric model used in their CAT posits in terms of the 
unidimensional or multidimensional IRT model selected. They need to understand what 
the selected IRT model means in terms of the dimensionality of the content and tasks 
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associated with the items. They need to be familiar with how the IRT statistical 
parameters of the test items are estimated after their trialing. Above all, they must know 
what is necessary to implement a valid and reliable CAT. 
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